Hello and welcome to this documentation where I use embeddings and its features to guide our decision-making for our risk polarization study. In this project, I want to get a feel for which risks to take for our study. This small project is done within the context of a large language seminar, and therefore will be rough around the edges.
Our study wants to ask participants in how they rate certain risks in their perceived riskiness. The “Risk Group” has decided to use a more systematic approach in filtering out risks which are worth investigating. They used several sources, such as referring to the Basel Risk Norms and 2 separate scoping reviews (one from our Risk Polarization group, the other from Amanda), and have listed around 100 risks, and labeled them according to domains (such as health, finances, political, crime and nature). Though to be as efficient as possible, we have to choose risks which are worth to be asked in the first place, as some risks are more similar than others.
One way to do it is to ask humans to rate/ sort them into clusters/ domains themselves. But as this also takes time and money, this project tries to leverage embeddings to do the clustering and mapping. Huge shout out to the rpackage embedR and its author Dirk Wulff. Working with embeddings in R is made very easy with this package, in addition to the generous pipeline(s) provided in his github webpage.
This small project aims to solve the problem of choosing the right risks in each domain without redundancies, that is not taking too similar risk items in each domain. Another problem is identifying whether the different framings how one can ask about the dimensions of risks would change how the risk is perceived.
We can solve the first problem with the embeddings, as we would get a similarity matrix, and can therefore map the risks on a two dimensional space. The second problem can be solved when we compare said similarity matrix from the different dimensions with each other after the risks are already seggregated into clusters with unsupervised machine learning method.
THIS IS WHERE I INSERT MY PERSONAL HUGGINGFACE TOKEN! DUE TO SECURITY REASONS, THIS IS NOT PROVIDED HERE
er_set_tokens(huggingface = "YOUR PERSONAL TOKEN FROM HUGGINGFACE")
risk_dat <- read_xlsx("Risks.xlsx", range = "B1:F21")
#pivot longer (so each risk and their domain are in a column)
##remove the numbers on top of each risk word
risk_dat <- risk_dat %>%
pivot_longer(everything(), names_to = "Domain/Label", values_to = "Risk/Items") %>%
mutate(`Risk/Items` = str_replace_all(`Risk/Items`, "¹|²|³|⁴|⁵|⁶|⁷|⁸|⁹", ""))
#it seems like we only have 99 unique risks, instead of 100. Also, some have two risks in it.
#I'll just use the first one, as the risks are more similar to others to begin with.
unique(risk_dat$`Risk/Items`)
#motor vehicles appears two times.
risk_dat[duplicated(risk_dat$`Risk/Items`),2]
dupl <- risk_dat[duplicated(risk_dat$`Risk/Items`),]$`Risk/Items`
risk_dat[risk_dat$`Risk/Items` == dupl,]
#remove duplicated
risk_dat <- risk_dat[!duplicated(risk_dat$`Risk/Items`),]
#remove second risks in an item
risk_dat <- risk_dat %>%
mutate(`Risk/Items` = sub("/.*", "", `Risk/Items`),
`Risk/Items` = sub("\\.*", "", `Risk/Items`),
`Risk/Items`= gsub("[\r\n]", "", `Risk/Items`)) %>%
arrange(`Domain/Label`, `Risk/Items`)
Here is a list of risks we are working with:
risk_dat %>%
arrange(`Domain/Label`) %>%
paged_table(options = list(rownames.print = F,
rows.print = 20))
Though we also want to contextualize the risks. We’ll use the paper from Wilson et al. (2018) as reference for this.
In their article, one could take 4 different dimensions: there is a generic question asking about how risky one think X is. And then there are 3 other ways (or dimensions) of how can ask about ones risk perception: affect, probability, and consequences. So let’s do it for all contextualized methods. For the 3 dimensions, we will take the one question with the highest factor loading, so each dimension has one question.
risk_dat <- risk_dat %>%
mutate(Risk_ID = 1:nrow(risk_dat))
dimension = c("generic", "affect", "probability", "consequences")
risk_grid <- expand_grid(Risks = unique(risk_dat$`Risk/Items`), dimension = dimension)
risk_dat_contextualized <- risk_dat %>%
full_join(risk_grid, by = c("Risk/Items" = "Risks")) %>%
mutate(`Risk/Items` = case_when(dimension == "generic" ~ paste0("How risky do you think ", `Risk/Items`, " is?"),
dimension == "affect" ~ paste0("When you think about ", `Risk/Items`, " for a moment, to what extent do you feel worried?"),
dimension == "probability" ~ paste0("How likely is it that ", `Risk/Items`, " will occur this year where you live?"),
dimension == "consequences" ~ paste0("If I did experience ", `Risk/Items`, " it would have a severe effect on me personally")
)) %>%
arrange(dimension, `Domain/Label`)
risk_dat_contextualized %>%
select(`Risk/Items`, everything()) %>%
paged_table(options = list(rownames.print = F,
rows.print = 20))
Though in this case, not every word is perfectly fitting to the question itself, but it should suffice for this small project. The Risk ID is here so we can match the risks again so we can compare all dimensions of risks (either none, or the generic, affect, probability, consequences).
For our LLM presentation, we want to include less risks so it is
easier to understand the plots etc., so we will handpick some of the
risks among the 99.
# selected_risks <- c("assassin", "bomb", "gun", "drugs", "global warming", "suicide", "vaccine", "war", "climate change", "alcohol", "fraud", "social risk", "ageing of population", "corruption", "propaganda", "environmental threats")
#
# risk_dat <- risk_dat %>%
# filter(`Risk/Items` %in% selected_risks)
#
# risk_dat_contextualized <- risk_dat_contextualized %>%
# filter(Risk_ID %in% risk_dat$Risk_ID)
The data has two columns, the name of the domains
("Domain/Label"), and the risks (Risk/Items),
which are the targets of the embedding analysis. Using the
er_embed() function, we can embed the items. We will be
using the default all-mpnet-base-v2 model from hugging
face. This model is a good lightweight model for embedding analyses.
risk_embed <- er_embed(text = risk_dat$`Risk/Items`,
api = "huggingface",
model = "sentence-transformers/all-mpnet-base-v2")
We can now see the embeddings for each risk
risk_embed
##
## embedR object
##
## Embedding: 99 objects and 768 dimensions.
##
##
## Embedding
##
## [,1] [,2] [,3] [,4] [,5]
## assassin 0.018385978 0.04980664 -0.004807225 0.035280682 -0.013225548
## assassination 0.040638089 0.02382069 0.010378473 0.016700501 -0.040772166
## bomb -0.001085747 0.07159345 -0.014801372 0.031823762 -0.004992277
## crime 0.018048594 0.08018764 0.026342748 0.001579845 -0.075590685
## criminal -0.025894137 0.08807318 0.009723256 -0.009204310 -0.067338876
We will use multidimensional scaling to reduce the dimensions to two which helps us in visualizing the result. The following plot is colored by “Team Risks’s” domain labels. Looking at the plot, most of the items are well withing their respective labels, but there are still outliers.
risk_reduced_df <- risk_embed %>%
er_project() %>%
er_frame()
risk_table_embed <- risk_dat %>%
inner_join(risk_reduced_df, by = c("Risk/Items" = "text"))
risk_table_embed %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = `Domain/Label`))+
geom_point(size = 2)+
geom_text(nudge_y = .01, size = 5.5)+
theme_minimal()+
theme(legend.position = "top") +
labs(y = "Dimension 2",
x = "Dimension 1",
title = "Single Risk Items")
Instead of comparing each risk to others visually, we can calculate the cosine similarity so we also have a numerical representation of how similar they are to each other (this would also help us in choosing which risks to take, as we don’t want too similar ones, as they would become redundant).
risk_cosine <- risk_embed %>%
er_compare_vectors(metric = "cosine")
risk_cosine_clean <- crossing(item_i = rownames(risk_cosine),
item_j = colnames(risk_cosine)) %>%
mutate(cosine = risk_cosine[cbind(item_i, item_j)]) %>%
left_join(risk_table_embed %>% select(`Domain/Label`, `Risk/Items`),
by = c("item_i" = "Risk/Items")) %>%
left_join(risk_table_embed %>% select(`Domain/Label`, `Risk/Items`),
by = c("item_j" = "Risk/Items"), suffix = c("_i", "_j"))
Let’s do the same for the contextualized risks!
risk_contextualized_embed <- er_embed(text = risk_dat_contextualized$`Risk/Items`,
api = "huggingface",
model = "sentence-transformers/all-mpnet-base-v2")
risk_reduced_contextualized_df <- risk_contextualized_embed %>%
er_project() %>%
er_frame()
risk_table_contextualized_embed <- risk_dat_contextualized %>%
inner_join(risk_reduced_contextualized_df, by = c("Risk/Items" = "text")) %>%
inner_join(risk_dat, by = "Risk_ID", suffix = c("_context", "_single_risk"))
risk_amount <- nrow(risk_dat)
for (i in 1:length(dimension)) {
dim <- dimension[i]
assign(x = paste(dim, "reduced_contextualized_df", sep = "_"),
value = risk_table_contextualized_embed[seq(from = 1 + risk_amount * (i -1), to = risk_amount * i),])
}
risk_table_contextualized_embed %>%
ggplot(aes(x = dim_1, y = dim_2, label = dimension, col = `Domain/Label_single_risk`)) +
geom_point(size = 2) +
geom_text(nudge_y = .01, size = 5.5) +
theme_minimal() +
theme(legend.position = "top") +
labs(y = "Dimension 2",
x = "Dimension 1",
title = "Mapping of different contextualized risks")
We can see from this plot that these different dimensions are in fact different from the perspective of embeddings. Only the generic and affective portion are more or less the same (at least in the 2 dimensions).
For those who are interested, here is each dimension specific visualization. Be careful of the different axis- scales though! As described above, they do differ between dimensions!
generic_reduced_contextualized_df %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
geom_point(size = 2) +
geom_text(nudge_y = .01, size = 3) +
theme_minimal() +
theme(legend.position = "top") +
labs(y = "Dimension 2",
x = "Dimension 1",
title = "Generic Framing")
affect_reduced_contextualized_df %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
geom_point(size = 2) +
geom_text(nudge_y = .01, size = 3) +
theme_minimal() +
theme(legend.position = "top") +
labs(y = "Dimension 2",
x = "Dimension 1",
title = "Affect Framing")
probability_reduced_contextualized_df %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
geom_point(size = 2) +
geom_text(nudge_y = .01, size = 3) +
theme_minimal() +
theme(legend.position = "top") +
labs(y = "Dimension 2",
x = "Dimension 1",
title = "Probability Framing")
consequences_reduced_contextualized_df %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items_single_risk`, col = `Domain/Label_single_risk`)) +
geom_point(size = 2) +
geom_text(nudge_y = .01, size = 3) +
theme_minimal() +
theme(legend.position = "top") +
labs(y = "Dimension 2",
x = "Dimension 1",
title = "Consequence Framing")
From a first glance, our labels are still doing fine even in a contextualized perspective.
Though we also want to investigate whether the risks are stable across those different dimensions! For this, we can compare each risk with another risk using the cosine similarity (created with the embeddings).
risk_cosine_contextualized <- risk_contextualized_embed %>%
er_compare_vectors(metric = "cosine")
risk_cosine_contextualized_clean <- crossing(item_i = rownames(risk_cosine_contextualized),
item_j = colnames(risk_cosine_contextualized)) %>%
mutate(cosine = risk_cosine_contextualized[cbind(item_i, item_j)]) %>%
left_join(risk_table_contextualized_embed %>% select(`Domain/Label_context`, `Risk/Items_context`, `Risk/Items_single_risk`, dimension),
by = c("item_i" = "Risk/Items_context")) %>%
left_join(risk_table_contextualized_embed %>% select(`Domain/Label_context`, `Risk/Items_context`, `Risk/Items_single_risk`, dimension),
by = c("item_j" = "Risk/Items_context"), suffix = c("_i", "_j")) %>%
filter(item_j != item_i)
base_cosine_ID <- risk_cosine_clean %>%
filter(item_i != item_j) %>%
left_join(risk_dat, by = c("item_i" = "Risk/Items")) %>%
left_join(risk_dat %>%
select(`Risk/Items`, Risk_ID), by = c("item_j" = "Risk/Items"), suffix = c("_i", "_j")) %>%
mutate(Risk_ID_i_j = paste(Risk_ID_i, Risk_ID_j)) %>%
select(cosine, Risk_ID_i_j)
context_cosine_ID <- risk_cosine_contextualized_clean %>%
select(cosine, `Risk/Items_single_risk_i`, `Risk/Items_single_risk_j`, dimension_i, dimension_j) %>%
left_join(risk_dat, by = c("Risk/Items_single_risk_i" = "Risk/Items")) %>%
left_join(risk_dat %>%
select(`Risk/Items`, Risk_ID), by = c("Risk/Items_single_risk_j" = "Risk/Items"), suffix = c("_i", "_j")) %>%
select(cosine, Risk_ID_i, Risk_ID_j, dimension_i, dimension_j) %>%
filter(dimension_i == dimension_j) %>%
mutate(Risk_ID_i_j = paste(Risk_ID_i, Risk_ID_j),
dimension = dimension_i) %>%
select(cosine, Risk_ID_i_j, dimension, Risk_ID_i)
joined_cosine <- context_cosine_ID %>%
pivot_wider(names_from = dimension,
values_from = cosine,
names_prefix = "cosine_") %>%
left_join(base_cosine_ID, by = "Risk_ID_i_j")
cor_mat_comparison <- joined_cosine %>%
summarize(
base_gen = cor(cosine, cosine_generic),
base_aff = cor(cosine, cosine_affect),
base_prob = cor(cosine, cosine_probability),
base_cons = cor(cosine, cosine_consequences),
gen_aff = cor(cosine_generic, cosine_affect),
gen_prob = cor(cosine_generic, cosine_probability),
gen_cons = cor(cosine_generic, cosine_consequences),
affect_prob = cor(cosine_affect, cosine_probability),
affect_cons = cor(cosine_affect, cosine_consequences),
prob_cons = cor(cosine_probability, cosine_consequences)
)
cor_mat_comparison %>%
t() %>%
as.data.frame() %>%
kable(col.names = "Correlations", digits = 3)
| Correlations | |
|---|---|
| base_gen | 0.754 |
| base_aff | 0.728 |
| base_prob | 0.638 |
| base_cons | 0.682 |
| gen_aff | 0.874 |
| gen_prob | 0.822 |
| gen_cons | 0.837 |
| affect_prob | 0.840 |
| affect_cons | 0.835 |
| prob_cons | 0.781 |
While these pairwise correlations indicate moderate to strong correlations over different dimensions/framing contexts (e.g. meaning that the similarity between risks are not greatly influenced by the context the risk appears in), these are averaged over the whole risk set. Let’s plot some more and see whether there were strong outliers!
joined_cosine %>%
select(cosine_generic, contains("cosine")) %>%
ggpairs(progress = FALSE,
axisLabels = "show",
columnLabels = c("Generic", "Probability", "Consequences", "Affect", "Single"))+
theme_minimal()+
theme(strip.text = element_text(size = 14))
While these correlations are within their dimension setting, what about the single risks? We want to use risks items which are stable across settings!
We can therefore group by the Risk itself, and calculate those correlations again (e.g. correlation of two cosine similarity vectors of each pairing with the risk). A correlation of 1 means that the distance/ cosine similarity stayed the same independent of the dimension setting.
I therefore calculated a mean score consisting of the correlations over each pairwise context setting. Higher score means that said risk item had consistent distances with other risks over all the pairwise settings.
cor_mat_per_risk <- joined_cosine %>%
group_by(Risk_ID_i) %>%
summarize(
base_gen = cor(cosine, cosine_generic),
base_aff = cor(cosine, cosine_affect),
base_prob = cor(cosine, cosine_probability),
base_cons = cor(cosine, cosine_consequences),
gen_aff = cor(cosine_generic, cosine_affect),
gen_prob = cor(cosine_generic, cosine_probability),
gen_cons = cor(cosine_generic, cosine_consequences),
affect_prob = cor(cosine_affect, cosine_probability),
affect_cons = cor(cosine_affect, cosine_consequences),
prob_cons = cor(cosine_probability, cosine_consequences)
) %>%
rowwise() %>%
mutate(mean_score = mean(c(base_gen,
base_aff,
base_prob,
base_cons,
gen_aff,
gen_prob,
gen_cons,
affect_prob,
affect_cons,
prob_cons))) %>%
arrange(desc(mean_score)) %>%
left_join(risk_dat, by = c("Risk_ID_i" = "Risk_ID"))
cor_mat_per_risk %>%
select(- Risk_ID_i) %>%
relocate(`Risk/Items`, `Domain/Label`) %>%
kable(digits = 3)
| Risk/Items | Domain/Label | base_gen | base_aff | base_prob | base_cons | gen_aff | gen_prob | gen_cons | affect_prob | affect_cons | prob_cons | mean_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| greenhouse gases | Nature | 0.922 | 0.932 | 0.879 | 0.917 | 0.956 | 0.923 | 0.932 | 0.908 | 0.949 | 0.914 | 0.923 |
| political parties | Political | 0.905 | 0.912 | 0.892 | 0.887 | 0.948 | 0.953 | 0.933 | 0.940 | 0.914 | 0.925 | 0.921 |
| political preference | Political | 0.905 | 0.919 | 0.894 | 0.892 | 0.939 | 0.957 | 0.929 | 0.936 | 0.922 | 0.913 | 0.921 |
| political partisanship | Political | 0.908 | 0.920 | 0.884 | 0.895 | 0.952 | 0.942 | 0.934 | 0.932 | 0.921 | 0.904 | 0.919 |
| political inclinations | Political | 0.917 | 0.922 | 0.869 | 0.899 | 0.946 | 0.939 | 0.942 | 0.922 | 0.927 | 0.887 | 0.917 |
| emissions | Nature | 0.892 | 0.862 | 0.860 | 0.860 | 0.947 | 0.920 | 0.926 | 0.911 | 0.909 | 0.892 | 0.898 |
| political ideology | Political | 0.870 | 0.879 | 0.840 | 0.834 | 0.939 | 0.930 | 0.936 | 0.918 | 0.913 | 0.892 | 0.895 |
| coal power | Nature | 0.887 | 0.876 | 0.851 | 0.850 | 0.944 | 0.902 | 0.898 | 0.908 | 0.924 | 0.899 | 0.894 |
| homicide | Crime | 0.875 | 0.868 | 0.843 | 0.836 | 0.926 | 0.905 | 0.911 | 0.951 | 0.926 | 0.881 | 0.892 |
| trust government | Political | 0.893 | 0.902 | 0.849 | 0.886 | 0.933 | 0.868 | 0.905 | 0.909 | 0.895 | 0.865 | 0.891 |
| natural gas | Nature | 0.891 | 0.858 | 0.829 | 0.838 | 0.952 | 0.892 | 0.917 | 0.899 | 0.920 | 0.904 | 0.890 |
| political position | Political | 0.889 | 0.872 | 0.844 | 0.829 | 0.929 | 0.953 | 0.911 | 0.919 | 0.875 | 0.879 | 0.890 |
| murder | Crime | 0.868 | 0.843 | 0.810 | 0.816 | 0.934 | 0.894 | 0.921 | 0.921 | 0.925 | 0.873 | 0.881 |
| influenza | Health | 0.873 | 0.870 | 0.770 | 0.877 | 0.924 | 0.896 | 0.910 | 0.884 | 0.928 | 0.872 | 0.880 |
| party identification | Political | 0.870 | 0.869 | 0.865 | 0.796 | 0.924 | 0.928 | 0.866 | 0.904 | 0.852 | 0.848 | 0.872 |
| zika virus | Health | 0.880 | 0.870 | 0.840 | 0.854 | 0.875 | 0.904 | 0.868 | 0.877 | 0.877 | 0.871 | 0.872 |
| nuclear power | Nature | 0.867 | 0.839 | 0.830 | 0.830 | 0.920 | 0.881 | 0.885 | 0.899 | 0.890 | 0.863 | 0.870 |
| climate change | Nature | 0.874 | 0.864 | 0.752 | 0.810 | 0.946 | 0.886 | 0.906 | 0.880 | 0.919 | 0.861 | 0.870 |
| murderer | Crime | 0.850 | 0.841 | 0.802 | 0.810 | 0.924 | 0.921 | 0.880 | 0.932 | 0.863 | 0.850 | 0.867 |
| tsunami | Nature | 0.850 | 0.830 | 0.764 | 0.831 | 0.919 | 0.897 | 0.893 | 0.902 | 0.921 | 0.861 | 0.867 |
| global warming | Nature | 0.843 | 0.818 | 0.735 | 0.773 | 0.952 | 0.911 | 0.926 | 0.894 | 0.936 | 0.876 | 0.866 |
| cyclones | Nature | 0.856 | 0.821 | 0.746 | 0.850 | 0.933 | 0.862 | 0.934 | 0.864 | 0.942 | 0.848 | 0.866 |
| liberal | Political | 0.831 | 0.829 | 0.795 | 0.763 | 0.929 | 0.904 | 0.906 | 0.922 | 0.884 | 0.886 | 0.865 |
| earthquakes | Nature | 0.860 | 0.808 | 0.771 | 0.829 | 0.885 | 0.876 | 0.924 | 0.878 | 0.924 | 0.885 | 0.864 |
| politician | Political | 0.834 | 0.818 | 0.777 | 0.786 | 0.927 | 0.930 | 0.895 | 0.913 | 0.861 | 0.901 | 0.864 |
| environmental threats | Nature | 0.848 | 0.851 | 0.768 | 0.858 | 0.915 | 0.856 | 0.889 | 0.861 | 0.912 | 0.867 | 0.862 |
| criminal | Crime | 0.841 | 0.851 | 0.768 | 0.739 | 0.931 | 0.890 | 0.901 | 0.940 | 0.889 | 0.846 | 0.859 |
| wildfires | Nature | 0.855 | 0.824 | 0.754 | 0.810 | 0.903 | 0.882 | 0.902 | 0.875 | 0.910 | 0.875 | 0.859 |
| economic crisis | Others (Social/Finance) | 0.835 | 0.848 | 0.767 | 0.817 | 0.920 | 0.874 | 0.895 | 0.859 | 0.926 | 0.848 | 0.859 |
| anarchy | Political | 0.817 | 0.772 | 0.764 | 0.761 | 0.935 | 0.913 | 0.917 | 0.912 | 0.891 | 0.901 | 0.858 |
| fascism | Political | 0.807 | 0.817 | 0.786 | 0.766 | 0.933 | 0.888 | 0.883 | 0.920 | 0.900 | 0.879 | 0.858 |
| floods | Nature | 0.846 | 0.811 | 0.737 | 0.824 | 0.913 | 0.868 | 0.906 | 0.875 | 0.931 | 0.861 | 0.857 |
| recession | Others (Social/Finance) | 0.818 | 0.822 | 0.728 | 0.787 | 0.913 | 0.884 | 0.902 | 0.885 | 0.939 | 0.878 | 0.856 |
| assassination | Crime | 0.856 | 0.805 | 0.780 | 0.809 | 0.931 | 0.922 | 0.862 | 0.930 | 0.848 | 0.802 | 0.855 |
| hurricanes | Nature | 0.824 | 0.770 | 0.721 | 0.796 | 0.911 | 0.880 | 0.922 | 0.880 | 0.931 | 0.859 | 0.849 |
| fracking | Nature | 0.826 | 0.789 | 0.765 | 0.775 | 0.912 | 0.889 | 0.907 | 0.879 | 0.888 | 0.863 | 0.849 |
| pollution | Nature | 0.779 | 0.791 | 0.709 | 0.747 | 0.934 | 0.881 | 0.915 | 0.891 | 0.950 | 0.897 | 0.849 |
| drought | Nature | 0.791 | 0.794 | 0.699 | 0.771 | 0.937 | 0.890 | 0.931 | 0.864 | 0.933 | 0.881 | 0.849 |
| crime | Crime | 0.820 | 0.815 | 0.754 | 0.762 | 0.909 | 0.885 | 0.881 | 0.933 | 0.869 | 0.842 | 0.847 |
| low level of criminal punishment | Crime | 0.823 | 0.840 | 0.776 | 0.797 | 0.918 | 0.852 | 0.872 | 0.899 | 0.860 | 0.824 | 0.846 |
| covid-19 | Health | 0.824 | 0.824 | 0.743 | 0.826 | 0.908 | 0.849 | 0.882 | 0.866 | 0.890 | 0.836 | 0.845 |
| nuclear bomb | Crime | 0.831 | 0.775 | 0.729 | 0.782 | 0.906 | 0.853 | 0.902 | 0.896 | 0.906 | 0.848 | 0.843 |
| introducing new product to the market | Others (Social/Finance) | 0.882 | 0.836 | 0.772 | 0.815 | 0.891 | 0.845 | 0.846 | 0.849 | 0.855 | 0.827 | 0.842 |
| dictator | Political | 0.770 | 0.774 | 0.722 | 0.727 | 0.931 | 0.909 | 0.883 | 0.939 | 0.881 | 0.878 | 0.842 |
| hiv | Health | 0.850 | 0.789 | 0.760 | 0.776 | 0.893 | 0.872 | 0.867 | 0.869 | 0.882 | 0.842 | 0.840 |
| gene technology | Health | 0.849 | 0.803 | 0.810 | 0.782 | 0.892 | 0.858 | 0.853 | 0.838 | 0.812 | 0.843 | 0.834 |
| diabetes | Health | 0.803 | 0.778 | 0.699 | 0.740 | 0.928 | 0.885 | 0.874 | 0.881 | 0.898 | 0.845 | 0.833 |
| health behaviour | Political | 0.799 | 0.805 | 0.757 | 0.755 | 0.936 | 0.843 | 0.858 | 0.875 | 0.892 | 0.778 | 0.830 |
| kidnapping | Crime | 0.853 | 0.815 | 0.714 | 0.716 | 0.908 | 0.862 | 0.844 | 0.898 | 0.864 | 0.804 | 0.828 |
| conservative | Political | 0.802 | 0.830 | 0.726 | 0.762 | 0.897 | 0.821 | 0.866 | 0.872 | 0.843 | 0.834 | 0.825 |
| terrorist | Crime | 0.804 | 0.766 | 0.665 | 0.771 | 0.886 | 0.842 | 0.887 | 0.895 | 0.879 | 0.845 | 0.824 |
| climate | Nature | 0.753 | 0.784 | 0.664 | 0.774 | 0.929 | 0.839 | 0.887 | 0.842 | 0.910 | 0.841 | 0.822 |
| vaccine | Health | 0.789 | 0.682 | 0.769 | 0.811 | 0.858 | 0.836 | 0.883 | 0.795 | 0.846 | 0.870 | 0.814 |
| decreasing turnout | Political | 0.792 | 0.819 | 0.762 | 0.785 | 0.896 | 0.884 | 0.759 | 0.843 | 0.832 | 0.753 | 0.812 |
| marketization | Others (Social/Finance) | 0.843 | 0.801 | 0.654 | 0.677 | 0.937 | 0.838 | 0.821 | 0.873 | 0.856 | 0.795 | 0.810 |
| assassin | Crime | 0.828 | 0.811 | 0.728 | 0.752 | 0.873 | 0.741 | 0.838 | 0.866 | 0.851 | 0.785 | 0.807 |
| cigarettes | Health | 0.805 | 0.776 | 0.690 | 0.750 | 0.913 | 0.847 | 0.850 | 0.843 | 0.827 | 0.771 | 0.807 |
| bomb | Crime | 0.794 | 0.697 | 0.690 | 0.768 | 0.876 | 0.842 | 0.851 | 0.889 | 0.818 | 0.772 | 0.800 |
| propaganda | Political | 0.749 | 0.756 | 0.667 | 0.719 | 0.897 | 0.816 | 0.877 | 0.881 | 0.849 | 0.775 | 0.799 |
| refugee | Others (Social/Finance) | 0.786 | 0.726 | 0.636 | 0.740 | 0.890 | 0.864 | 0.857 | 0.848 | 0.827 | 0.804 | 0.798 |
| electric power | Nature | 0.792 | 0.750 | 0.694 | 0.779 | 0.920 | 0.793 | 0.823 | 0.832 | 0.824 | 0.768 | 0.798 |
| killing | Crime | 0.796 | 0.759 | 0.678 | 0.709 | 0.896 | 0.841 | 0.847 | 0.873 | 0.820 | 0.738 | 0.796 |
| sun protection | Health | 0.774 | 0.696 | 0.646 | 0.771 | 0.839 | 0.816 | 0.868 | 0.799 | 0.854 | 0.839 | 0.790 |
| ageing of population | Others (Social/Finance) | 0.766 | 0.821 | 0.662 | 0.796 | 0.852 | 0.845 | 0.769 | 0.807 | 0.831 | 0.750 | 0.790 |
| genocide | Crime | 0.788 | 0.734 | 0.697 | 0.729 | 0.827 | 0.808 | 0.852 | 0.827 | 0.816 | 0.795 | 0.787 |
| holocaust | Crime | 0.746 | 0.746 | 0.711 | 0.694 | 0.866 | 0.821 | 0.811 | 0.845 | 0.846 | 0.767 | 0.785 |
| corruption | Political | 0.655 | 0.695 | 0.597 | 0.608 | 0.913 | 0.895 | 0.866 | 0.912 | 0.857 | 0.846 | 0.784 |
| suicide | Others (Social/Finance) | 0.810 | 0.745 | 0.670 | 0.696 | 0.900 | 0.784 | 0.855 | 0.818 | 0.837 | 0.723 | 0.784 |
| poison | Crime | 0.753 | 0.808 | 0.606 | 0.775 | 0.919 | 0.762 | 0.823 | 0.787 | 0.869 | 0.737 | 0.784 |
| terrorism | Crime | 0.724 | 0.687 | 0.647 | 0.673 | 0.884 | 0.828 | 0.843 | 0.870 | 0.854 | 0.823 | 0.783 |
| epidemic | Health | 0.785 | 0.742 | 0.647 | 0.773 | 0.860 | 0.829 | 0.844 | 0.736 | 0.819 | 0.798 | 0.783 |
| sexual assault | Crime | 0.782 | 0.800 | 0.715 | 0.659 | 0.869 | 0.798 | 0.823 | 0.860 | 0.808 | 0.690 | 0.780 |
| inflation | Others (Social/Finance) | 0.741 | 0.683 | 0.648 | 0.656 | 0.876 | 0.830 | 0.771 | 0.872 | 0.819 | 0.851 | 0.775 |
| drowning | Health | 0.766 | 0.731 | 0.618 | 0.701 | 0.855 | 0.805 | 0.866 | 0.794 | 0.837 | 0.772 | 0.775 |
| cancer | Health | 0.661 | 0.698 | 0.552 | 0.648 | 0.867 | 0.872 | 0.879 | 0.863 | 0.850 | 0.817 | 0.771 |
| drugs | Health | 0.737 | 0.729 | 0.587 | 0.668 | 0.917 | 0.814 | 0.811 | 0.829 | 0.853 | 0.730 | 0.767 |
| mental health | Health | 0.776 | 0.756 | 0.624 | 0.667 | 0.848 | 0.827 | 0.848 | 0.774 | 0.807 | 0.736 | 0.766 |
| revolution | Political | 0.642 | 0.568 | 0.577 | 0.585 | 0.912 | 0.898 | 0.856 | 0.885 | 0.815 | 0.892 | 0.763 |
| globalization | Others (Social/Finance) | 0.693 | 0.676 | 0.504 | 0.618 | 0.926 | 0.860 | 0.836 | 0.865 | 0.833 | 0.792 | 0.760 |
| dementia | Health | 0.718 | 0.724 | 0.614 | 0.586 | 0.881 | 0.864 | 0.801 | 0.840 | 0.815 | 0.749 | 0.759 |
| cannabis | Health | 0.762 | 0.742 | 0.571 | 0.701 | 0.914 | 0.789 | 0.815 | 0.751 | 0.851 | 0.673 | 0.757 |
| health | Health | 0.751 | 0.748 | 0.600 | 0.696 | 0.865 | 0.785 | 0.806 | 0.808 | 0.783 | 0.645 | 0.749 |
| consumerism | Others (Social/Finance) | 0.568 | 0.633 | 0.538 | 0.553 | 0.902 | 0.859 | 0.807 | 0.853 | 0.839 | 0.757 | 0.731 |
| speeding | Crime | 0.733 | 0.665 | 0.547 | 0.600 | 0.857 | 0.746 | 0.820 | 0.736 | 0.824 | 0.777 | 0.730 |
| skiing | Others (Social/Finance) | 0.761 | 0.658 | 0.663 | 0.668 | 0.811 | 0.703 | 0.749 | 0.711 | 0.798 | 0.741 | 0.726 |
| fraud | Political | 0.635 | 0.649 | 0.556 | 0.508 | 0.822 | 0.810 | 0.743 | 0.856 | 0.785 | 0.699 | 0.706 |
| war | Crime | 0.684 | 0.628 | 0.605 | 0.596 | 0.782 | 0.771 | 0.681 | 0.825 | 0.739 | 0.704 | 0.702 |
| pregnancy | Health | 0.674 | 0.653 | 0.510 | 0.584 | 0.778 | 0.790 | 0.856 | 0.640 | 0.723 | 0.707 | 0.692 |
| social media | Others (Social/Finance) | 0.662 | 0.556 | 0.513 | 0.568 | 0.853 | 0.774 | 0.756 | 0.820 | 0.662 | 0.735 | 0.690 |
| social risk | Others (Social/Finance) | 0.700 | 0.719 | 0.600 | 0.622 | 0.804 | 0.650 | 0.651 | 0.780 | 0.730 | 0.578 | 0.683 |
| stress | Health | 0.711 | 0.635 | 0.438 | 0.643 | 0.815 | 0.628 | 0.789 | 0.666 | 0.785 | 0.651 | 0.676 |
| alcohol | Health | 0.719 | 0.723 | 0.437 | 0.658 | 0.865 | 0.634 | 0.746 | 0.639 | 0.802 | 0.494 | 0.672 |
| unemployment | Others (Social/Finance) | 0.669 | 0.619 | 0.472 | 0.471 | 0.801 | 0.784 | 0.651 | 0.813 | 0.773 | 0.662 | 0.671 |
| motor vehicles | Others (Social/Finance) | 0.679 | 0.630 | 0.487 | 0.581 | 0.859 | 0.746 | 0.642 | 0.745 | 0.689 | 0.575 | 0.663 |
| farming | Nature | 0.616 | 0.481 | 0.545 | 0.498 | 0.853 | 0.765 | 0.721 | 0.740 | 0.659 | 0.582 | 0.646 |
| media | Others (Social/Finance) | 0.651 | 0.608 | 0.405 | 0.563 | 0.882 | 0.641 | 0.710 | 0.686 | 0.588 | 0.569 | 0.630 |
| changing values | Others (Social/Finance) | 0.543 | 0.582 | 0.557 | 0.476 | 0.690 | 0.677 | 0.781 | 0.633 | 0.412 | 0.672 | 0.602 |
| artificial intelligence | Others (Social/Finance) | 0.620 | 0.468 | 0.272 | 0.473 | 0.781 | 0.651 | 0.671 | 0.648 | 0.675 | 0.602 | 0.586 |
| isolation | Others (Social/Finance) | 0.612 | 0.550 | 0.058 | 0.452 | 0.837 | 0.246 | 0.734 | 0.328 | 0.741 | 0.339 | 0.490 |
Let us pick the top performing risk (e.g. the risk where the distance
between other risks stayed the same, independent of the setting).
In this case, it was greenhouse gases
top_risk <- cor_mat_per_risk$Risk_ID_i[1]
joined_cosine %>%
filter(Risk_ID_i == top_risk) %>%
select(contains("cosine")) %>%
ggpairs(progress = FALSE,
axisLabels = "show",
columnLabels = c("Generic", "Probability", "Consequences", "Affect", "Single"))+
theme_minimal()+
theme(strip.text = element_text(size = 14))
The prior approach only compared how stable the distance between a risk is with all the other risk in comparison with the same risk combination in the other dimensions. While we do see that most of the risks have stable distance to the other risks among the different settings, this was only in comparison with other risks, not as a standalone.
We will adress this problem here, where we now compare each risk with itself in different dimensions. For example, we will compare “war” in the general setting with “war” in the affect setting. We will use the 768 dimensions for each risk and correlate them with it’s counterpart in the other settings.
large_embed <- rbind(risk_embed, risk_contextualized_embed)
pairw_cor_df<- data.frame()
for (i in 1:nrow(risk_dat)){
small_mat <-
large_embed[seq(from = i,
to = nrow(large_embed),
by = nrow(risk_dat)), ] %>%
t()
risk_word <- colnames(small_mat)[1]
colnames(small_mat) <- c("single", "affect", "consequences", "general", "probability")
small_mat <- small_mat %>%
cor() %>%
as.data.frame() %>%
rownames_to_column(var = "dim_1") %>%
pivot_longer(cols = 2:6, names_to = "dim_2", values_to = "correlation")
cor_long <- small_mat[small_mat$dim_1 > small_mat$dim_2, ]
cor_long$risk <- risk_word
pairw_cor_df <- rbind(pairw_cor_df, cor_long)
}
wide_pairw_cor <- pairw_cor_df %>%
pivot_wider(
id_cols = risk,
names_from = c(dim_1, dim_2),
values_from = correlation
)
wide_pairw_cor
## # A tibble: 99 × 11
## risk single_affect single_consequences single_general single_probability
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 assassin 0.591 0.510 0.614 0.428
## 2 assassin… 0.570 0.593 0.626 0.505
## 3 bomb 0.507 0.486 0.628 0.485
## 4 crime 0.563 0.519 0.596 0.505
## 5 criminal 0.527 0.504 0.546 0.416
## 6 genocide 0.607 0.649 0.655 0.554
## 7 holocaust 0.594 0.651 0.669 0.573
## 8 homicide 0.509 0.564 0.564 0.509
## 9 kidnappi… 0.615 0.568 0.650 0.521
## 10 killing 0.463 0.431 0.501 0.352
## # ℹ 89 more rows
## # ℹ 6 more variables: consequences_affect <dbl>, general_affect <dbl>,
## # general_consequences <dbl>, probability_affect <dbl>,
## # probability_consequences <dbl>, probability_general <dbl>
Again, we now have all the risks and 10 pairwise correlations. I’ll rank them based on the mean of all the correlations, so we can see which risk was the most stable across dimensions.
wide_pairw_cor <- wide_pairw_cor %>%
rowwise() %>%
mutate(
cor_mean = mean(c(
single_affect,
single_consequences,
single_general,
single_probability,
consequences_affect,
general_affect,
general_consequences,
probability_affect,
probability_consequences,
probability_general
))
) %>%
arrange(desc(cor_mean))
wide_pairw_cor %>%
relocate(cor_mean, .after = risk) %>%
head(50) %>%
left_join(risk_dat, by = c("risk" = "Risk/Items")) %>%
select(risk,`Domain/Label`, cor_mean ) %>%
kable(digits = 3)
| risk | Domain/Label | cor_mean |
|---|---|---|
| decreasing turnout | Political | 0.734 |
| social risk | Others (Social/Finance) | 0.726 |
| economic crisis | Others (Social/Finance) | 0.715 |
| zika virus | Health | 0.711 |
| low level of criminal punishment | Crime | 0.708 |
| coal power | Nature | 0.699 |
| recession | Others (Social/Finance) | 0.696 |
| gene technology | Health | 0.694 |
| fracking | Nature | 0.693 |
| environmental threats | Nature | 0.689 |
| anarchy | Political | 0.689 |
| wildfires | Nature | 0.689 |
| ageing of population | Others (Social/Finance) | 0.686 |
| cyclones | Nature | 0.685 |
| drought | Nature | 0.683 |
| floods | Nature | 0.682 |
| trust government | Political | 0.678 |
| earthquakes | Nature | 0.674 |
| influenza | Health | 0.673 |
| tsunami | Nature | 0.672 |
| greenhouse gases | Nature | 0.670 |
| epidemic | Health | 0.670 |
| natural gas | Nature | 0.669 |
| fascism | Political | 0.667 |
| introducing new product to the market | Others (Social/Finance) | 0.666 |
| kidnapping | Crime | 0.664 |
| political partisanship | Political | 0.663 |
| emissions | Nature | 0.661 |
| nuclear power | Nature | 0.660 |
| corruption | Political | 0.659 |
| consumerism | Others (Social/Finance) | 0.658 |
| fraud | Political | 0.657 |
| health behaviour | Political | 0.654 |
| refugee | Others (Social/Finance) | 0.653 |
| dementia | Health | 0.650 |
| sun protection | Health | 0.649 |
| political preference | Political | 0.644 |
| speeding | Crime | 0.644 |
| holocaust | Crime | 0.643 |
| genocide | Crime | 0.642 |
| hurricanes | Nature | 0.642 |
| pollution | Nature | 0.641 |
| party identification | Political | 0.639 |
| revolution | Political | 0.638 |
| political inclinations | Political | 0.638 |
| poison | Crime | 0.637 |
| globalization | Others (Social/Finance) | 0.637 |
| vaccine | Health | 0.636 |
| skiing | Others (Social/Finance) | 0.636 |
| covid-19 | Health | 0.634 |
# top_50_list <- wide_pairw_cor %>%
# relocate(cor_mean, .after = risk) %>%
# head(50) %>%
# left_join(risk_dat, by = c("risk" = "Risk/Items")) %>%
# select(risk,`Domain/Label`, cor_mean )
#
# wb <- createWorkbook(top_50_list)
# addWorksheet(wb, sheetName = "Top 50 Risks")
# writeDataTable(wb, "Top 50 Risks", x = top_50_list)
# saveWorkbook(wb, "risk_embedded.xlsx")
One could ask whether these 5 domains are accurately describing our risks. Luckily, thanks to the embeddings, we have a similarity matrix now (cosine similarity). With this, kmeans clustering becomes available. We are using the single risk dataframe, so without context/ question.
The following plots are generated with differing cluster amounts. As a rule of thumb, the higher the F-statistic (ratio of within labels sum of squares and between labels sum of squares), the better.
set.seed(123)
k <- 3
N <- dim(risk_cosine)[1]
cl_risk <- kmeans(risk_cosine, centers = k)
Fstat <- (cl_risk$betweenss / (k-1)) / (cl_risk$tot.withinss/ (N-k))
risk_table_embed %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = factor(cl_risk$cluster)))+
geom_point(size = 2)+
geom_text(nudge_y = .01, size = 4)+
theme_minimal()+
theme(legend.position = "top")+
ylab("Dimension 2")+
xlab("Dimension 1")+
labs(color = "Risk Clusters")+
annotate("text", x = 0, y= .22, label = paste0("F-value = ", round(Fstat,3)), size = 5, color = "black")
set.seed(123)
k <- 4
N <- dim(risk_cosine)[1]
cl_risk <- kmeans(risk_cosine, centers = k)
Fstat <- (cl_risk$betweenss / (k-1)) / (cl_risk$tot.withinss/ (N-k))
risk_table_embed %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = factor(cl_risk$cluster)))+
geom_point(size = 2)+
geom_text(nudge_y = .01, size = 4)+
theme_minimal()+
theme(legend.position = "top")+
ylab("Dimension 2")+
xlab("Dimension 1")+
labs(color = "Risk Clusters")+
annotate("text", x = 0, y= .22, label = paste0("F-value = ", round(Fstat,3)), size = 5, color = "black")
set.seed(123)
k <- 5
N <- dim(risk_cosine)[1]
cl_risk <- kmeans(risk_cosine, centers = k)
Fstat <- (cl_risk$betweenss / (k-1)) / (cl_risk$tot.withinss/ (N-k))
risk_table_embed %>%
ggplot(aes(dim_1, dim_2, label = `Risk/Items`, col = factor(cl_risk$cluster)))+
geom_point(size = 2)+
geom_text(nudge_y = .01, size = 4)+
theme_minimal()+
theme(legend.position = "top")+
ylab("Dimension 2")+
xlab("Dimension 1")+
labs(color = "Risk Clusters")+
annotate("text", x = 0, y= .22, label = paste0("F-value = ", round(Fstat,3)), size = 5, color = "black")
F-values indicate that 4 different clusters is best!
Here are the risks, sepparated to the different 4 clusters! Now the only thing we need to do is making sense of these clusters…
set.seed(42)
cl_risk <- kmeans(risk_cosine, centers = 4)
risks <- colnames(risk_cosine)
cluster <- cl_risk$cluster
clustered_df <- data.frame(risks, cluster)
rownames(clustered_df) <- NULL
clustered_df_wide <- clustered_df %>%
pivot_wider(names_from = cluster,
names_prefix = "cluster_",
values_from = risks,
values_fn = list(risks = list))
cluster_dat <- data.frame(
unlist(clustered_df_wide$cluster_1),
c(unlist(clustered_df_wide$cluster_2), rep(NA, 4)),
c(unlist(clustered_df_wide$cluster_3), rep(NA, 7)),
c(unlist(clustered_df_wide$cluster_4), rep(NA, 10))
)
colnames(cluster_dat) <- paste0("cluster_", 1:4)
cluster_dat %>%
kable()
| cluster_1 | cluster_2 | cluster_3 | cluster_4 |
|---|---|---|---|
| gene technology | low level of criminal punishment | poison | assassin |
| sun protection | consumerism | speeding | assassination |
| climate | economic crisis | alcohol | bomb |
| climate change | globalization | cancer | crime |
| coal power | marketization | cannabis | criminal |
| cyclones | media | cigarettes | genocide |
| drought | recession | covid-19 | holocaust |
| earthquakes | refugee | dementia | homicide |
| electric power | social risk | diabetes | kidnapping |
| emissions | anarchy | drowning | killing |
| environmental threats | conservative | drugs | murder |
| farming | decreasing turnout | epidemic | murderer |
| floods | dictator | health | nuclear bomb |
| fracking | fascism | hiv | sexual assault |
| global warming | liberal | influenza | terrorism |
| greenhouse gases | party identification | mental health | terrorist |
| hurricanes | political ideology | pregnancy | war |
| natural gas | political inclinations | stress | suicide |
| nuclear power | political parties | vaccine | corruption |
| pollution | political partisanship | zika virus | fraud |
| tsunami | political position | isolation | NA |
| wildfires | political preference | unemployment | NA |
| ageing of population | politician | health behaviour | NA |
| artificial intelligence | propaganda | NA | NA |
| changing values | revolution | NA | NA |
| inflation | trust government | NA | NA |
| introducing new product to the market | NA | NA | NA |
| motor vehicles | NA | NA | NA |
| skiing | NA | NA | NA |
| social media | NA | NA | NA |
In this smaller project, which was done to make an informed decision
for choosing our polarized risks which are still stable on different
contexts (e.g. they do not change their meaning depending on how the
risk is framed/ the question was built upon). We have a ranked list of
stable to unstable words.
Likewise, the cluster analysis hinted towards 4 clusters, though we do
not have a good label to group them by…
embedR (Wulff DU (2023). embedR: Embed and analyze text. R package version 0.1.0, https://CRAN.R-project.org/package=embedR.)
GGally (Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2024). GGally: Extension to ‘ggplot2’. R package version 2.2.1, https://CRAN.R-project.org/package=GGally.)
knitr (Xie Y (2023). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.45, https://yihui.org/knitr/.)
openxlsx (Schauberger P, Walker A (2023). openxlsx: Read, Write and Edit xlsx Files. R package version 4.2.5.2, https://CRAN.R-project.org/package=openxlsx.)
readxl (Wickham H, Bryan J (2023). readxl: Read Excel Files. R package version 1.4.3, https://CRAN.R-project.org/package=readxl.)
rmarkdown (Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). rmarkdown: Dynamic Documents for R. R package version 2.25, https://github.com/rstudio/rmarkdown.)
tidyverse (Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.)